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Abstract 



The sum-product or belief propagation (BP) algorithm is a widely-used message- 
passing algorithm for computing marginal distributions in graphical models with discrete 
variables. At the core of the BP message updates, when applied to a graphical model 
with pairwisc interactions, lies a matrix-vector product with complexity that is quadratic 
in the state dimension d, and requires transmission of a (d — l)-dimensional vector of real 
numbers (messages) to its neighbors. Since various applications involve very large state 
dimensions, such computation and communication complexities can be prohibitively com- 
plex. In this paper, we propose a low-complexity variant of BP, referred to as stochastic 
belief propagation (SBP). As suggested by the name, it is an adaptively randomized ver- 
sion of the BP message updates in which each node passes randomly chosen information 
to each of its neighbors. The SBP message updates reduce the computational complexity 
(per iteration) from quadratic to linear in d, without assuming any particular structure of 
the potentials, and also reduce the communication complexity significantly, requiring only 
log d bits transmission per edge. Moreover, we establish a number of theoretical guaran- 
tees for the performance of SBP, showing that it converges almost surely to the BP fixed 
point for any tree-structured graph, and for graphs with cycles satisfying a contractiv- 
ity condition. In addition, for these graphical models, we provide non-asymptotic upper 
bounds on the convergence rate, showing that the loa norm of the error vector decays 
no slower than 0(l/i/i) with the number of iterations t on trees and the mean square 
error decays as 0(l/t) for general graphs. These analysis show that SBP can provably 
yield reductions in computational and communication complexities for various classes of 
graphical models0 

Keywords: Graphical models; sum-product algorithm; low-complexity belief propagation; 
randomized algorithm. 

1 Introduction 

Graphical models provide a general framework for describing statistical interactions among 
large collections of random variables. A broad range of fields — among them statistical sig- 
nal processing, computer vision, coding and information theory, and bioinformatics — involve 

1 Portions of the results given here were initially reported at the Allerton Conference on Communications, 
Control, and Computing (September 2011). 
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problems that can be fruitfully tackled using the formalism of graphical models. A computa- 
tional problem central to such applications is that of marginalization, meaning the problem 
of computing marginal distributions over a subset of random variables. Naively approached, 
these marginalization problems have exponential complexity, and hence are computationally 
intractable. Therefore, graphical models are only useful when combined with efficient algo- 
rithms. For graphs without cycles, the marginalization problem can be solved exactly and 
efficiently via an algorithm known as the sum-product or belief propagation (BP) algorithm. 
It is a distributed algorithm, in which each node performs a set of local computations, and 
then relays the results to its graph neighbors in the form of so-called messages. For graphs 
with cycles, BP is no longer an exact method, but nonetheless is widely used and known to 
be extremely effective in many settings. For a more detailed discussion of the role of the 
marginalization problem and the use of sum-product, we refer the reader to various overview 
papers (e.g., pH M EES E] ) . 

In many applications of BP, the messages themselves are high-dimensional in nature, ei- 
ther due to discrete random variables with a very large number of possible realizations d, 
which will be reffered to as the number of states, factor nodes with high degree, or continuous 
random variables that are discretized. Examples of such problems include disparity estima- 
tion in computer vision, tracking problems in sensor networks, and error-control decoding. 
For such problems, it may be expensive to compute and/or store the messages, and as a 
consequence, BP may run slowly, and be limited to small-scale instances. Motivated by this 
challenge, researchers have studied a variety of techniques to reduce complexity of BP in 
different applications (e.g., see the papers [HI EH HSJ EJ EE El [26] and references therein). 
At the core of sum-product message-passing is a matrix-vector multiplication, with complex- 
ity scaling quadratically in the number of states d. Certain graphical models have special 
structure that can be exploited so as to reduce this complexity. For instance, in applica- 
tion to the decoding of low-density parity check codes in channel coding (e.g., [I0l Ej), the 
complexity of message-passing, if performed naively, would scale exponentially in the factor 
degrees. However, a clever use of the fast Fourier transform over GF(2) reduces this com- 
plexity to linear in the factor degrees [25]. Other problems arising in computer vision involve 
pairwise factors with a circulant structure for which the fast Fourier transform can also re- 
duce complexity [9j. Similarly, computation can be accelerated by exploiting symmetry in 
factors [15] , or additional factorization properties of the distribution [19] . In the absence of 
structure to exploit, other researchers have proposed different types of quantization strategies 
for BP message updates [H [H], as well as stochastic methods based on particle filtering or 
non-parametric belief propagation (e.g., (3JE71E]) that approximate continuous messages by 
finite numbers of particles. For certain classes of these methods, it is possible to establish 
consistency as the number of particles tends to infinity [7] or establish finite-length results 
inversely proportional to the square root of the number of particles [13J. As the number of 
particles diverges, the approximation error becomes negligible, a property that underlies such 
consistency proofs. Researchers have also proposed stochastic techniques to improve the de- 
coding efficiency of binary error-correcting codes [30|. 121]. These techniques, which are based 
on encoding messages with sequences of Bernoulli random variables, lead to efficient decoding 
hardware architectures. 

In this paper, we focus on the problem of implementing BP in high-dimensional discrete 
spaces, and propose a novel low-complexity algorithm, which we refer to as stochastic belief 
propagation (SBP). As suggested by its name, it is an adaptively randomized version of the BP 
algorithm, where each node only passes randomly selected partial information to its neighbors 
at each round. The SBP algorithm has two features that makes it practically appealing. 
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First, it reduces the computational cost of BP by an order of magnitude; in concrete terms, 
for arbitrary pairwise potentials over d states, it reduces the per iteration computational 
complexity from quadratic to linear — that is, from 0(<i 2 ) to 0(d). Second, it significantly 
reduces the message/communication complexity, requiring transmission of only logd bits per 
edge as opposed to (d — 1) real numbers in the case of BP. 

Even though SBP is based on low-complexity updates, we are able to establish conditions 
under which it converges (in a stochastic sense) to the exact BP fixed point, and moreover, 
to establish quantitative bounds on this rate of convergence. These bounds show that SBP 
can yield provable reductions in the complexity of computing a BP fixed point to a tolerance 
5 > 0. In more precise terms, we first show that SBP is strongly consistent on any tree- 
structured graph, meaning that it converges almost surely to the unique BP fixed point; in 
addition, we provide non-asymptotic upper bounds on the £oo norm (maximum value) of the 
error vector as a function of iteration number (Theorem [T]) . For general graphs with cycles, 
we show that when the ordinary BP message updates satisfy a type of contraction condition, 
then the SBP message updates are strongly consistent, and converge in mean-squared error 
at the rate 0(l/t) to the unique BP fixed point, where t is the number of iterations. We 
also show that the typical performance is sharply concentrated around its mean (Theorem [2]) . 
These theoretical results are supported by simulation studies, showing the convergence of the 
algorithm on various graphs, and the associated reduction in computational complexity that 
is possible. 

The remainder of the paper is organized as follows. We begin in Section[2]with background 
on graphical models as well as the BP algorithm. In Section [3l we provide a precise description 
of the SBP, before turning in Section 13.21 to statements of our main theoretical results, as 
well as discussion of some of their consequences. Section 0] is devoted to the proofs of our 
results, with more technical aspects of the proofs deferred to the Appendices. In Section [5l 
we demonstrate the correspondence between our theoretical predictions and the algorithm's 
practical behavior. 

2 Background 

In this section, we provide some background on graphical models as well as the sum-product 
or belief propagation algorithm. 

2.1 Graphical Models 

Consider a random vector X := {X\, X%, . ■ ■ , X n }, where for each u = 1,2, ... ,n, the variable 
X u takes values in some discrete space X := {1, 2, . . . , d} with cardinality d. An undirected 
graphical model, also known as a Markov random field, defines a family of joint probability 
distributions over this random vector by associating the index set {1,2, ... ,n} with the vertex 
set V of an undirected graph Q = (V,£). In addition to the vertex set, the graph consists 
of a collection of edges £ C V x V, where a pair (u, v) £ £ if and only if nodes u and v 
are connected by an edge. The structure of the graph describes the statistical dependencies 
among the different random variables — in particular, via the cliques^ of the graph. For each 
clique / of the graph, let ipi : X^ 1 ^ — > (0, 00) be a function of the sub- vector Xj := { X u , u £ 1} 
of random variables indexed by the clique, and then consider the set of all distributions over 

2 A clique / of a graph is a subset of vertices that are all joined by edges, and so form a fully connected 
subgraph. 
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X that factorize as 
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where C is the set of all cliques in the graph. 

As a concrete example, consider the two-dimensional grid shown in Figure QJa). Since its 
cliques consist of the set of all vertices V together with the set of all edges £ , the general 
factorization (pQ) takes the special form 
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where ip u : X — > (0, oo) is the node potential function for node u, and ip uv : X x X — > (0, oo) 
is the edge potential function for the edge (u,v). A factorization of this form ([2]) is known 
as a pairwise Markov random field. It is important to note that there is no loss of generality 
in assuming a pairwise factorization of this form; indeed, any graphical model with discrete 
random variables can be converted into a pairwise form by suitably augmenting the state 
space (e.g., see Yedidia et al. [33] or Wainwright and Jordan [32] . Appendix E.3). Moreover, 
the sum-product message updates can be easily translated from the original graph to the 
pairwise graph, and vice versa. Accordingly, for the remainder of this paper, we focus on the 
case of a pairwise MRF. 
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Figure 1. Examples of pairwise Markov random fields, (a) A two-dimensional grid: potential 
functions ipu an d tpv are associated with nodes u and v respectively, whereas potential function 
4>uv is associated with edge (u, v). (b) Markov chain model including both hidden variables 
(afi, . . . , Xg), represented as white nodes, and observed variables (yi, . ,ys) represented as 
shaded nodes. 

In various application contexts, the random vector (Xi, . . . , X n ) is an unobserved or "hid- 
den" quantity, and the goal is to draw inferences on the basis of a collection of observations 
(Y\, . . . ,Y n ). The link between the observed and hidden variables is specified in terms of a 
conditional probability distribution, which in many cases can be written in the product form 
P(y | x) = n"=i ^(Uu I x u)- For instance, in error-control coding using a low-density parity 
check code, the vector X takes values in a linear subspace of {0, l} n , corresponding to valid 
codewords, and the observation vector Y is obtained from some form of memoryless channel 



4 



(e.g., binary symmetric, additive white Gaussian noise, etc.). In image denoising applications, 
the vector X represents a rasterized form of the image, and the observation Y corresponds to 
a corrupted form of the image. 

In terms of drawing conclusions about the hidden variables based on the observations, 
the central object is the posterior distribution P(x | y). From the definition of conditional 
probability and the form of the prior and likelihoods, this posterior can also be factorized in 
pairwise form 

n 

F(x | y) oc F(xi,...,x n ) ¥(y u \ x u ) = i> u (.Zu) ] { ^uv{x u ,x v ), (3) 

u=l u&V (u,v)e£ 

where ipu(%u) '■= ^u{xuW{yu \ x u ) is the new node compatibility function. (Since the obser- 
vation y u is fixed, there is no need to track its functional dependence.) Thus, the problem 
of computing marginals for a posterior distribution can be caslH as an instance of computing 
marginals for a pairwise Markov random field ([2]). 

Our focus in this paper is the marginalization problem, meaning the computation of the 
single-node marginal distributions 

P( x „) : = ^ P 04, • • • , x' n ) for each u G V, (4) 

{x'\x' u =x u } 

and more generally, higher-order marginal distributions on edges and cliques. Note that 
to calculate this summation, brute force is not tractable and requires nd n ~ 1 computations. 
For any graph without cycles — known as a tree — this computation can be carried far more 
efficiently in only 0(nd 2 ) operations using an algorithm known as the beilef propagation 
algorithm, to which we now turn. 

2.2 Sum-product Algorithm 

Belief propagation, also known as the sum-product algorithm, is an iterative algorithm con- 
sisting of a set of local message-passing rounds, for computing either exact or approximate 
marginal distributions. For tree-structured (cycle-free) graphs, it is known that BP message 
updates converge to the exact marginals in a finite number of iterations. However, the same 
message-passing updates can also be applied to more general graphs, and are known to be 
effective for computing approximate marginals in numerous applications. Here we provide a 
very brief treatment, referring the reader to various standard sources |17|l2tl33, 32J for further 
background. 

In order to define the message-passing updates, we require some further notation. For 
each node u G V, let M{u) := {w \ (w,u) G £} denote its set of neighbors, and let 
£{u) := {(u — > v)\ v G Af(u)} denote the set of all directed edges emanating from u. Fi- 
nally, we define £ := U ug y<f (it), the set of all directed edges in the graph; note that £ has 
cardinality 2\£\. In the BP algorithm, one message m uv G M d is assigned to every directed 
edge (u — > v) G £■ By concatenating all of these (i-vectors, one for each of the 2|£| members 
of £ , we obtain a L>-dimensional vector of messages m = fm„„\, where D := 2|£|<i. 

At each round t = 1, 2, . . ., every node u G V calculates a message m^ 1 G M. d to be sent 
to its neighbor v G M(u). In mathematical terms, this operation can be represented as an 

3 For illustrative purposes, we have assumed here that the distribution F(y \ x) has a product form, but a 
somewhat more involved reduction also applies to a general observation model. 
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update of the form m^ 1 = F uv (m t ) where F uv : M D — > M. d is the local update function of the 
directed edge (u — > v). In more detail, for each x v 6 X , we havd3 



mt uv l ( x v) =[Fuv{m t )\(x v ) = n ^2 i' l Puv{xu,x v )'ilj u {x u ) ] [ , m t wu (x u )\ 



(5) 



«;eA^(u)\{v} 



where k is a normalization constant chosen to ensure that ^2 Xv m t ^ u 1 {x v ) = 1. Figure [2(a) 
provides a graphical representation of the flow of information in this local update. 
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(a) 




Figure 2. Graphical representation of message-passing algorithms, (a) Node u transmits the 
message m uv — F uv (m), derived from equation ([5]), to its neighbor v. (b) Upon receiving all 
the messages, node v updates its marginal estimate according to ©. 

Equation ([5]) is basically an iterative way of solving a set of fixed-point equations in M> D . 
More precisely, by concatenating the local updates (|5|), we obtain a global update function 
F : R D -> R D of the form 



F(m) = {F uv (m)} 



(6) 



D 



Typically, the goal of message-passing is to obtain a fixed point, meaning a vector m* G 
such that F(m*) = m*. For any tree-structured graph, it is known that the update © has 
a unique fixed point. For a general graph (with some mild conditions on the potentials; see 
Yedidia et al. [33] for details), it is known that the global update ([6]) has at least one fixed 
point, but it is no longer unique in general. However, there are various types of contraction 
conditions that can be used to guarantee uniqueness on a general graph (e.g., [29] , IT2 ], [20 ], [23] ) . 

Given a fixed point m*, node v computes its marginal (approximation) t* by combining 
the local potential function ip v with a product of all incoming messages as 



T v (%v) — k ip v {x v ) J~J m uv (x v ), 



(7) 



where k is a normalization constant chosen so that Yl x ex T v( x v) = 1- See Figure [^b) for 
an illustration of this computation. For any tree-structured graph, the quantity r*(x v ) is 
equal to the single- node marginal P(x„), as previously defined dH). For a graph with cycles, 
the vector r* represents an approximation to the single-node marginal, and is known to be a 
useful approximation for many classes of graphical models. 



4 It is worth mentioning that m^ 1 is only a function of the messages m t wu for w £ A/"(u)\{i>}. Therefore, we 
d , where p u is the degree of the node u. Since it is clear from the context and for the 
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have F uv : I 

purpose of reducing the notation overhead, we say m^ 1 = i ? tt „(m t ) instead of m^ 1 — -Fii,;({m4 u } 
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3 Algorithm and Main Results 



We now turn to a description of the SBP algorithm (Section I3.ip . as well as the statement of 
our main theoretical guarantees on its behavior (Section 13. 2p . 

3.1 Stochastic Belief Propagation 

When applied to a pairwise graphical model with random variables taking d states, the number 
of summations and multiplications required by original BP algorithm is 0((i 2 ) per iteration, 
as can be seen by inspection of the message update equation ([5]). This quadratic complexity — 
which is incurred on a per iteration, per edge basis — is prohibitive in many applications, where 
the state dimension may be on the order of thousands. As discussed earlier in Section [H al- 
though certain graphical models have particular structures that can be exploited to reduce 
complexity of the updates, not all problems have such special structures, so that a general- 
purpose approach is of interest. In addition to computational cost, a standard BP message 
update can also be expensive in terms of communication cost, since each update requires 
transmitting (d — 1) real numbers along each edge. For applications that involve power limi- 
tations, such as sensor networks, reducing this communication cost is also of interest. 

Stochastic belief propagation is an adaptively randomized form of the usual BP message 
updates that yields savings in both computational and communication cost. It is motivated 
by a simple observation — namely, that the message-passing update along the directed edge 
(u —> v) can be formulated as an expectation over suitably normalized columns of the com- 
patibility matrix. Here the probability distribution in question depends on the incoming 
messages, and changes from iteration to iteration. This perspective leads naturally to an 
adaptively randomized variant of BP: instead of computing and transmitting the full expec- 
tation at each round — which incurs @(d 2 ) computational cost and requires sending Q(d) real 
numbers — the SBP algorithm simply picks a single normalized column with the appropriate 
(message-dependent) probability, and performs a randomized update. As we show, each such 
operation can be performed in 0{d) time and requires transmitting only logd bits, so that 
the SBP message updates are less costly by an order of magnitude. 

With this intuition in hand, we are now ready for a precise description of the SBP al- 
gorithm. Let us view the edge potential function ip uv as a matrix of numbers ipuv(i,j), for 
i,j = l,...,d. For the directed edge (u — > v), define the collection of column vectors 

r w (:,j) := ^— ttt , for j = 1,2,..., d, (8) 

Puv (J ) 

where /3 uv (j) := Xrf=i rfuvih j) At(i)- We assume that the column vectors T uv (:,j) and 
normalization constants /3 uv (j) have been pre-computed and stored, which can be done in 
an off-line manner. In addition, the algorithm makes use of a positive sequence of step sizes 
{A*}^ . In terms of these quantities, the SBP algorithm consists of the steps shown in 
Figure El 

The per iteration computational complexity of the SBP algorithm lies in calculating the 
probability mass function p uv , defined in equation (fTUj) : generating a random index J uv ac- 
cording to the mass function (jlOp . and performing the weighted update (jlip . Denoting the 
maximum degree of the graph by p max , we require at most (p max — l)d multiplications to 
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Stochastic Belief Propagation Algorithm: 




(I) Initialize the message vector m° G M. D . 




(II) For iterations t = 0, 1, 2, 3, . . ., and for each directed edge (u — > 


u) G £: 


(a) Compute the product of incoming messages: 




KvH) = II m tw(») foriG{l,. 


-,4- (9) 


i»e^(u)\{ti} 




(b) Pick a random index J^ 1 G {1, 2, . . . , tf} according to the probability distribu- 


tion 




pLU) « MhU) Puv(j) for j e {1, • • • 


d}. (10) 


(c) For a given step size A* G (0, 1), update the message m*+ x 


G K d via 


mit 1 = (l-A«)mL + AT^:,./** 1 


)■ (11) 



Figure 3: Specification of stochastic belief propagation. 



compute M uv . Moreover, an additional 3d operations are needed to compute the probability 
mass function p uv . On the other hand, generating a random index J uv , can be done with 
less than d operations by picking a number U uniformly at random from [0, 1] and setting^] 
J uv := inf jj : Yli=iPuv(i) > U}. Finally the update (fTTj) needs 3d + 3 operations. Adding 
up these contributions, we find that the SBP algorithm requires at most (p m ax + 6)(i + 3 multi- 
plications and/or summations per iteration per edge to update the messages. As can be seen 
from equation ([5]), the regular BP complexity is 0((i 2 ). Therefore, for graphs with bounded 
degree (of most interest in practical applications), the SBP message updates have reduced 
the per iteration computational complexity by a factor of d. In addition to computational 
efficiency, SBP provides us with a significant gain in message/communication complexity over 
BP. This can be observed from the fact that the normalized compatibility matrix T uv is only 
a function of edge potentials ip uv , hence known to the node v. Therefore, node u has to 
transmit the random column index J uv to node v, which can be done with only logd bits. 
This is a significant gain over BP that requires transmitting a (d — l)-dimensional vector of 
real numbers per edge at every round. Here we summarize the features of our algorithm that 
make it appealing for practical purposes. 

• Computational complexity: SBP reduces the per iteration complexity by an order of 
magnitude from @(d 2 ) to Q(d). 

• Communication complexity: SBP requires transmitting only logd bits per edge in con- 
trast to transmitting a {d — l)-dimensional vector of real numbers in the case of BP. 

The remainder of the paper is devoted to understanding when, and if so, how quickly the 
SBP message updates converge to a BP fixed point. Let us provide some intuition as to why 

5 It is known that for any distribution function G(-), the random variable G -1 (C/) has the distribution G(-). 
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such a behavior might be expected. Recall that the update (jlip is random, depending on the 
choice of index J chosen in step 11(b). Suppose that we take expectations of the update (llip 
only over the distribution (|10p . in effect conditioning on all past randomness in the algorithm. 
(We make this idea precise via the notion of a- fields in our analysis.) Doing so yields that 
the the expectation of the update (fTT|) is given by 

d 

EK 1 I mt v ] = (1 - A') mi, + A* r w (:,j). 

i=i 

Recalling the definitions ([8]) and (jlOp of the matrix Y and mass function p, respectively 
and performing some algebra, we see that, in an average sense, the SBP message update is 
equivalent to (a damped version of the) usual BP message update. The technical difficulties 
lie in showing that despite the fluctuations around this average behavior, the SBP updates 
still converge to the BP fixed point when the stepsize or damping parameter A* is suitably 
chosen. We now turn to precisely this task. 

3.2 Main Theoretical Results 

Thus far, we have proposed a stochastic variant of the usual belief propagation (BP) algorithm. 
In contrast to the usual deterministic updates, this algorithm generates a random sequence 
{m*}£2. of message vectors. This randomness raises two natural questions: 

• Is the SBP algorithm strongly consistent? More precisely, assuming that the ordinary 
BP algorithm has a unique fixed point m*, under what conditions do we have m —> m* 
almost surely as t — > oo? 

• When convergence occurs, how fast does it take place? The computational complexity 
per iteration is significantly reduced, but what are the trade-offs incurred by the number 
of iterations required? 

The goal of this section is to provide some precise answers to these questions, ones which 
show that under certain conditions, there are provable gains to be achieved by the SBP 
algorithm. We begin with the case of trees, for which the ordinary BP message updates 
are known to have a unique fixed point for any choice of potential functions. For any tree- 
structured problem, the upcoming Theorem Q] guarantees that the SBP message updates are 
strongly consistent, and moreover tlicit in terms of tlie elementwise £qq norm they converge in 
expectation at least as quickly as 0(l/yt), where t is the number of iterations. We then turn 
to the case of general graphs. Although the BP fixed point need not be unique in general, 
a number of contractivity conditions that guarantee uniqueness and convergence of ordinary 
BP have been developed (e.g., [29j HI 1201 [23]). Working under such conditions, we show in 
Theorem [2] that the SBP algorithm is strongly consistent, and we show that the mesn square 
error decays at least as quickly as 0(1/ 1). In addition, we provide high probability bounds 
on the error at each iteration, showing that the typical performance is highly concentrated 
around its average. Finally, in Section 13,2.31 we provide a new set of sufficient conditions 
for contractivity in terms of node/edge potentials and the graph structure. As we discuss, 
our theoretical analysis shows not only that SBP is provably correct, but also that in various 
regimes, substantial gains in overall computational complexity can be obtained relative to the 
ordinary BP. 
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3.2.1 Guarantees for Tree-structured Graphs 

We begin with the case of a tree-structured graph, meaning a graph Q that contains no 
cycles. As a special case, the Markov chain shown in Figure [T^b) is an instance of such a tree- 
structured graph. Recall that for some integer r > 1, a square matrix A is said to be nilpotent 
of degree r if A r = 0. (We refer the reader to Horn and Johnson [11] for further background on 
nilpotent matrices and their properties.) Also recall the definition of the diameter of a graph 
G, denoted by diam(^), as the length (number of edges) of the longest path between any pair 
of nodes in the graph. For a tree, this diameter can be at most n — 1, a bound achieved by 
the chain graph. In stating Theorem [IJ we make use of the following definition: for vectors 
x,y G M. D , we write x ^ y if and only if x{%) < y(i) for all i = 1,2,... ,D. Moreover, for 
an arbitrary x S R D , let \x\ denote the vector obtained from taking the absolute value of its 
elements. With this notation in hand, we are now ready to state our first result. 

Theorem 1 (Tree-structured graphs). For any tree-structured Markov random field, the se- 
quence of messages {m'}^ generated by the SBP algorithm with step size A* = \/{t-\-l), has 
the following properties: 

(a) The message sequence {m*}^ converges almost surely to the unique BP fixed point m* 
as t — > oo. 

(b) There exist a nilpotent matrix A 6 M. DxD of degree at most r = diam(C?) such that the 
D-dimensional error vector m* — m* satisfies the elementwise inequality 

E[|m* - m*\] < 4 (I - 2A)' 1 — for all iterations t = 1,2, .. .. (12) 

yt 

Remarks: The proof of this result is given in Section 14. 11 Part (a) shows that the SBP 
algorithm is guaranteed to converge almost surely to the unique BP fixed point, regardless of 
the choice of node/edge potentials and the initial message vector. Part (b) refines this claim 
by providing a quantitative upper bound on the rate of convergence: in expectation, the 
norm of the error vector is guaranteed to decay at the rate 0{l/y/t). As noted by a helpful 
reviewer, the upper bound in part (b) is likely to be conservative at times, since the inverse 
matrix (1 — 2A) _1 may have elements that grow exponentially in the graph diameter r. As 
shown by our experimental results, the theory is overly conservative in this way, as SBP still 
behaves well on trees with large diameters (such as chain). Indeed, in the following section, 
we provide results for general graphs under contractive conditions that are less conservative. 

3.2.2 Guarantees for General Graphs 

Our next theorem addresses the case of general graphs. In contrast to the case of tree- 
structured graphs, depending on the choice of potential functions, the BP message updates 
may have multiple fixed points, and need not converge in general. A sufficient condition for 
both uniqueness and convergence of the ordinary BP message updates, which we assume in 
our analysis of SBP, is that the update function F, defined in ([6j), is contractive. In particular, 
it suffices that there exist some < \x < 2 such that 

\\F(m) - F(m')\\ 2 < (l - |) \\m - m'\\ 2 . (13) 

Past work has established contractivity conditions of this form when the BP updates are 
formulated in terms of log messages [29j [121 EOj [23] . In Section [3231 we use related techniques 
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to establish sufficient conditions for contractivity for the BP message update F that involves 
the messages (as opposed to log messages). 

Recalling the normalized compatibility matrix with columns ^uvi},]) '■= ^«»0> j)^u(j) / fiuvij)) 
we define its minimum and maximum values per row as follows o 

Rl v {i) := mmT uv (i,j) > 0, and B° uv {i) := maxT uv (i,j) < 1. (14) 
The pre-factor in our bounds involves the constant 

KW--4,— — o TTTs ■ (15) 

With this notation, we have the following result: 

Theorem 2 (General graphs). Suppose that the BP update function F : M. D — > H D satisfies 
the contraction condition (1131). 



(a) Then BP has a unique fixed point m* , and the SBP message sequence {m t } ( j^ , generated 
with the step size A* = 0{l/t), converges almost surely to m* as t — >■ oo. 

(b) With the step size A* = a/(fi (t + 2)) for some fixed 1 < a < 2, we have 

E[|K-m*g < 3* if (V>) a 2 f l\ \\m°-m*\\l _ 
||m*||| ~ 2 a /x 2 (a-l) \t) \\m*\\l * ' f 



for all iterations t = 1,2,.... 
(c) With the step size A* = l/(fi(t + 1)), we have 



E[||m*-m*||j] < K(V0 / 1 + logt y (|7) 



ll m II2 

also for every < e < 1 and t>2, we have 

||m*-m*||| K{i>) ( 8\/l + logi 



12 

with probability at least 1 — e 



*w- £ i^v + 7t- {—^> (18) 



Remarks: The proof of Theorem [2] is given in Section 14.21 Here we discuss some of the 
various guarantees that it provides. First, part (a) of the theorem shows that the SBP algo- 
rithm is strongly consistent, in that it converges almost surely to the unique BP fixed point. 
This claim is analogous to the almost sure convergence established in Theorem [2(a) for trees. 
Second, the bound (fTBj) in Theorem 0(b) provides a non-asymptotic bound on the normalized 
mean-squared error E[||m* — w*||2]/|| m *|||]- F° r the specified choice of step-size (1 < a < 2), 
the first component of the bound (|16p is dominant, hence the expected error (in squared 
^2-norm) is of the ordei0 1/t. Therefore, after t = 0(l/<5) iterations, the SBP algorithm 

6 As will be discussed later, we can obtain a sequence of more refined (tighter) lower {B^„(i)}£oi an d upper 
{B e uv (i)}^l bounds by confining the space of feasible messages. 

7 At least superficially, this rate might appear faster than the 1/y/t rate established for trees in Theorem[]Jb); 
however, the reader should be careful to note that Theorem [1] involves the elementwise ^oo-norm, which is not 
squared, as opposed to the squared ^2-norm studied in Theorem [21 
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returns a solution with MSE at most 0(6). Finally, part (c) provides bounds, both in expec- 
tation and with high probability, for a slightly different step size choice. On one hand, the 
bound in expectation (|17p is of the order 0(logt/t), and so includes an additional logarithmic 
factor not present in the bounds from part (b). However, as shown in the high probability 
bound (fl8|) . the squared error is also guaranteed to satisfy a sample-wise version of the same 
bound with high probability. This theoretical claim is consistent with our later experimental 
results, showing that the error exhibits tight concentration around its expected behavior. 

Let us now compare the guarantees of SBP to those of BP. Under the contraction condition 
of Theorem EJ the ordinary BP message updates are guaranteed to converge geometrically 
quickly, meaning that 0(log(l/<5)) iterations are sufficient to obtain (5-accurate solution. In 
contrast, under the same conditions, the SBP algorithm requires Q(l/5) iterations to return 
a solution with MSE at most 6, so that its iteration complexity is larger. However, as noted 
earlier, the BP message updates require @(d 2 ) operations for each edge and iteration, whereas 
the SBP message updates require only Q(d) operations. Putting the pieces together, we 
conclude that: 

• on one hand, ordinary BP requires ©(|£| d 2 log(l/<5)) operations to compute the fixed 
point to 5-accuracy; 

• in comparison, SBP requires 0(|£ | d (l/<5)) operations to compute the fixed point to 
expected accuracy 5. 

Consequently, we see that as long the desired tolerance is not too small — in particular, if 
6 > 1/d — then SBP leads to computational savings. In many practical applications, the state 
dimension is on the order of 10 3 to 10 5 , so that the precision 5 can be of the order 10 -3 to 
1CP 5 before the complexity of SBP becomes of comparable order to that of BP. Given that 
most graphical models represent approximations to reality, it is likely that larger tolerances 
5 are often of interest. 



3.2.3 Sufficient Conditions for Contractivity 

Theorem [2] is based on the assumption that the update function is contractive, meaning 
that its Lipschitz constant L is less than one. In past work, various authors have developed 
contractivity conditions, based on analyzing the log messages, that guarantee uniqueness and 
convergence of ordinary BP (e.g., [29^ [T2"l [201 123j). Our theorem requires contractivity on 
the messages (as opposed to log messages), which requires a related but slightly different 
argument. In this section, we show how to control L and thereby provide sufficient conditions 
for Theorem [2] to be applicable. 

Our contractivity result applies when the messages under consideration belong to a set of 
the form 

S := lm£R D \ Y,m U v(i) = 1, R uv (i) <m uv (i) <B uv (i) V(« -)> v) € £ , Vi € x\, (19) 
^ iex ' 

for some choice of the upper and lower bounds — namely, B uv (i) and B_ uv (i) respectively. For 
instance, for all iterations t = 0,1,..., the messages always belong to a set of this forrrJl 

8 It turns out that the BP update function on the directed edge (it — > v) is a convex combination of normalized 
columns F uv (:,j) for j = 1, . . . , d. Therefore, we have B_^ v (i) < m uv (i) < B uv (i), for all i = 1, . . . , d. 



12 



with B_ uv (i) = B® uv {i) and B uv {i) = Ep uv {i), as previously defined CEU). Since the bounds 
(B^ v (i), B° uv (i)) do not involve the node potentials, one suspects that they might be tightened 
at subsequent iterations, and indeed, there is a progressive refinement of upper and lower 
bounds of this form. Indeed, assuming that the messages belong to a set S at an initial 
iteration, then for any subsequent iterations, we are guaranteed the inclusion 

m G F(S) := {F{m') G R D | m'e5}, (20) 

which then leads to the refined upper and lower bounds 



pi ,-, ■ IX^T f -\ Puv(j) M uv (j) , 

i=i 



7TtG5 



E?=i/M*)M U „(£) 



■si /.n f ST- r c -\ Puv(j) M uv(j) 

B uvW ■= sup <^ > r^ y — - — 



where we recall the quantity M uv = Y\ we j\f( u )\{ v } m wu previously defined Q. While such 
refinements are possible, in order to streamline our presentation, we focus primarily on the 
zero'th order bounds B_ uv (i) = B^ v {i), and B uv (i) = B uv (i). 

Given a set S of the form (fT9j) . we associate with the directed edge (u — > v) and (w — > u) 
(where w G M(u)\{v}) the non-negative numbers 



where 



®l(u,v) := ^2 {(f>uv,wu((f>uv,wu + Xuv,wu)) 2 , and (21a) 

weN(u)\{v} 

_ ^ i 
® 2 (w,u) := 2^ {4>uv,wu {<t>uv,wu + Xuv,wu)) 2 , (21b) 

v£Af(u)\{w} 



max sup / — f uv ^ — wv ^ tttI) and (22a) 

jex meS I J2t=i Puv{k) M uv (k) m wu (j) J 



Xw.toit := max sup 



Puv (j) M u „ (j) A (j) (j) 1 

ie* me s l(ELiM*0 M ™(*0) 2 m ^ ^ 

Recall the normalized compatibility matrix r ut; G M dxrf on the directed edge (u — > v), 
as previously defined in equation ([8]). Since has positive entries, the Perron-Frobenius 
theorem [Tlj guarantees that the maximal eigenvalue is equal to one, and is associated with 
a pair of left and right eigenvectors (unique up to scaling) with positive entries. Since T^ v 
is row-stochastic, any multiple of the all-one vector 1 can be chosen as the right eigenvector. 
Letting z uv G M. d denote the left eigenvector with positive entries, we are guaranteed that 
f 7 \ uv > 0, and hence we may define the matrix T^ v — lz T v / (V" z uv ). By construction, this 
matrix has all of its eigenvalues strictly less than 1 in absolute value (Lemma 8.2.7, |llj). 



Proposition 1. The global update function F : MP — > M> D defined in equation ([6]) is Lipschitz 
with constant at most 



7 1 

-C u r -L 



L := 2 max ^||r u „ — ^- 1||2 max _$>\ (u,v) max ^2(w,u), (23) 

(u— >v)£E 1 Z uv (u^fv)££ (w>— >u)e£ 

where ||| • I2 denotes the maximum singular value of a matrix. 
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In order to provide some intuition for Proposition [U let us consider a simple but illuminating 
example. 

Example 1 (Potts model) . The Potts model [9j [28| [T6] is often used for denoising, segmen- 
tation, and stereo computation in image processing and computer vision. It is a pairwise 
Markov random field that is based on edge potentials of the form 



1 if i = j, and 
7 if * ¥= j- 



for all edges (u, v) G £ and i,j £ {1,2,... ,d}. The parameter 7 6 (0,1] can be tuned to 
enforce different degrees of smoothness: at one extreme, setting 7 = 1 enforces no smoothness, 
whereas a choice close to zero enforces a very strong type of smoothness. (To be clear, the 
special structure of the Potts model can be exploited to compute the BP message updates 
quickly; our motivation in considering it here is only to provide a simple illustration of our 
contractivity condition.) 

For the Potts model, we have /3 uv (j) = ip u (j) (1 + (d— 1)7), and hence F uv is a symmetric 
matrix with 



l+(d-l) 7 * - 3 
l+(d-l)7 lf 1 ^ J - 



Some straightforward algebra shows that the second largest singular value of T uv is given by 
(1 - 7)/(l + (d- 1)7), whence 

lllr z uv^ m 1-7 

IXLcLX HI J. yy 



l T z uv l + (d-l)7" 

The next step is to find upper bounds on the terms $i(u,v) and #2(w,m), in particular 
by upper bounding the quantities 4> U v,wu and Xuv,wu, as defined in equations (|22ap and (|22bp 
respectively. In Appendix [Aj we show that the Lipschitz function of F uv is upper bounded as 

L<4(l- 7 )(l + (d-l) 7 ) max ( (p " ~ 1)2 max/ 



where p u is the degree of node u. Therefore, a sufficient condition for contractivity in the case 
of the Potts model is 



m »< nagf _W?L- U < I w - ] . (24) 



»6V I 7"" i&r lEiLA-MJ J \^ (1-7) (1 + (<I -1)7) 

To gain intuition, consider the special case in which the node potentials are uniform, so 
that ipu(j) / (J2e=i ipu(£)) = 1/d. In this case, for any graph with bounded node degrees, 
the bound (|24|) guarantees contraction for all 7 in an interval [e, 1]. For non-uniform node 
potentials, the inequality (f24"j) is weaker, but it can be improved via the refined sets ([2U|) 
discussed previously. 
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4 Proofs 



We now turn to the proofs of our two main results, namely Theorems [U and [21 as well as the 
auxiliary result, Proposition [JJ on contractivity of the BP message updates. For our purposes, 
it is convenient to note that the ordinary BP update can be written as an expectation of the 
form 

F w (m*) = E 4 +i t [T uv (:, J** 1 )] , (25) 

" UV fill) 

for all t = 0, 1, Here the index J^ 1 is chosen randomly according to the probability mass 

function (fTUD . 

4.1 Proof of Theorem [TJ 

We begin by stating a lemma that plays a central role in the proof of Theorem [TJ 

Lemma 1. For any tree- structured Markov random field, there exist a nilpotent matrix 
A £ M. DxD of degree at most r = diam(<5) such that 

\F(m)-F{m')\ < A\m-m'\, (26) 

for all m,m'£iS. 

The proof of this lemma is somewhat technical, so that we defer it to Appendix [Bj In inter- 
preting this result, the reader should recall that for vectors x,y £ M 13 , the notation x < y 
denotes inequality in an elementwise sense — i.e., x(i) < y(i) for i = 1,. . . ,D. 

An immediate corollary of this lemma is the existence and uniqueness of the BP fixed 
point. Since we may iterate inequality (|26f) . we find that 

\F^(m)-F^(m')\ X A e \m-m'\, 

for all iterations £ = 1,2,..., and arbitrary messages m, vn! , where F^> denotes the com- 
position of F with itself £ times. The nilpotence of A ensures that A r = 0, and hence 
F^(m) = F( r )(m') for all messages m, and m! . Let m* = F^ r \m) denote the common value. 
The claim is that m* is the unique fixed point of the BP update function F. This can be 
shown as follows: from Lemma Q] we have 

\F(m*) - m*\ = \F {r+1 \m) - F^{m)\ ■< A\F^(m) - F^ r_1 ^ (m)|. 

Iterating the last inequality for the total of r times, we obtain 

\F(m*) - m*\ < A r \F{m) - m\ = 0, 

and hence F{m*) = m* . On the other hand, the uniqueness of the BP fixed point is a direct 
consequence of the facts that for any fixed point m* we have F^ r \m*) = m* , and for all 
arbitrary messages m, m' we have F^ r \m) = F"(m'). Accordingly, we see that Lemma [1] 
provides an alternative proof of the well-known fact that BP converges to a unique fixed point 
on trees after at most r = diam(^) iterations. 

We now show how Lemma Q] can be used to establish the two claims of Theorem [TJ 
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4.1.1 Part (a): Almost Sure Consistency 

We begin with the almost sure consistency claim of part (a). By combining all the local 
updates, we form the global update rule 

m t+l = (1 - A*) m* + A* v t+l for iterations t = 0, 1, 2, . . ., (27) 

where v t+1 := {T uv (:, J^ 1 )} r u -*v)e£ 1S ^ e ^-dimensional vector obtained from stacking up 
all the normalized columns T uv (:, J^ 1 )- Defining the vector Y t+1 := v t+1 — F(m t ) G R D , we 
can rewrite the update (p7|) as 

m t+1 = (1 - A') to* + A* F(m*) + A* y m for t = 0, 1, 2, ... . (28) 

With our step size choice A* = l/(t + 1), unwrapping the recursion (|28|) yields the represen- 
tation 

m' = j f>(m') + 7 

Subtracting the unique fixed point m* from both sides then leads to 

t-i t 
m'-m* = - ^( F ^ m£ ) ~ F ( m *)) + J J2 Y * + J ( F ( m °) ~ F i m *))> ( 29 ) 



where we have introduced the convenient shorthand Z l . We may apply triangle inequality to 
each element of this vector equation; doing so and using Lemma [U to upper bound the terms 
\F(rn ) — F(m*)\, we obtain the element-wise inequality 

1 £=i 

|m* - m*\ < - ^2 A \m l - m*\ + \Z l \ for t = 1,2,.... 

t=\ 

Since A T is the all-zero matrix, unwrapping the last inequality r = diam(^) times yields the 
element-wise upper bound 

|m* -m*\< G* + AG{ + A 2 G\ + ■■■ + A r ~ 1 G t r _ 1 , (30) 
where the terms G„ are defined via the recursion 

G\ ■= I Ei=i G e-i for £ = 1, . . . , r - 1, with 

initial conditions Gq := |Z |. 

It remains to control the sequences {G t i } ( ^ =1 for £ = 0, 1, . . . ,r — 1. In order to do so, 
we first establish a martingale difference property for the variables Y l defined prior to equa- 
tion (l28j) . For each t = 0, 1,2, . . ., define the cr-field J 7 * := a(mP,m l , . . . , m 4 ), as generated 
by the randomness in the messages up to time t. Based on the representation (j25j) . we see 
that E[y* +1 |J rt ] = 0, showing that {1"* +1 }^ forms martingale difference sequence with re- 
spect to the filtration {J-^j^Q. From the definition, it can be seen that the entries of Y t+l 
are bounded; more precisely, we have |Y t+1 («)| < 1 for all iterations t = 0,1,2,..., and all 
states i = 1, 2, . . . D. Consequently, the sequence {Y^ c ^ =1 is a bounded martingale difference 
sequence. 
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We begin with the term G . Since Y is a bounded martingale difference, standard conver- 
gence results [8] guarantee that | Yle=i Y e \/t — > almost surely. Moreover, we have the bound 
\F(mP) — F{m*)\/t ■< 1/t. Recalling the definition of Z l from equation (j29|) . we conclude that 
Gq = \Z t \ converges to the all-zero vector almost surely as t — > 00. In order to extend our 
argument to the terms G\ for I = l,...,r — 1, we make use of the following fact: for any 
sequence of real numbers {x*}^ such that x l — > 0, then we also have (X^=o x ^)A ~~ * (e-g-, 



see Royden 



Consequently, for any realization oj such that the deterministic sequence 



{Go(w)}^ converges to zero, we are also guaranteed that the sequence {G\(uj)}^ , with 
elements G\(co) = {Y!j=i G J (oj))/t, converges to zero. Since we have shown that Gq ^> 0, we 
conclude that G\ ^4' as well. This argument can be iterated, thereby establishing almost 
sure convergence for all of the terms G\. Putting the pieces together, we conclude that the 
m*\ converges almost surely to the all-zero vector as t — > 00, thereby completing 



vector \m 



the proof of part (a). 



4.1.2 Part (b): Bounds on Expected Absolute Error 

We now turn to part (b) of Theorem [U which provides upper bounds on the expected absolute 
error. We establish this claim by exploiting some martingale concentration inequalities [5]. 
From part (a), we know that {Y t } ( j^L 1 is a bounded martingale difference sequence, in particular 
with < 1. Applying the Azuma-Hoeffding inequality [5] yields the tail bound 



for all 7 > 0, and i = 1,2, ... ,D. By integrating this tail bound, we can upper bound the 
mean: in particular, we have 



E 



L 1=1 
and hence 

E[G* ] = E[|Z*|] 
Turning to the term G\, we have 



f p(i|E^»l>7)*r< ^ 



-< 



2vr -. 1 A 
1 + - -< -pi. 

t t ~ y/t 



(31) 



(i) 1^4 

-< 



* - (») 2 • 4 



1, 



where step (i) uses the inequality (j3T|) . and step (ii) is based on the elementary upper bound 
StiVv^ <! + /]* l/y/xdx < 1\ft. By repeating this same argument in a recursive 
manner, we conclude that E [G\] < (2 £ ■ A/y/i) 1 for £ = 2, 3, . . . , r- 1. Taking the expectation 
on both sides of the the inequality ([30]) and substituting these upper bounds, we obtain 

Epm'-m*!] ± 4(xW)i= = 4(7 -2A)" 1 



where we have used the fact that A T 



0. 
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4.2 Proof of Theorem H 

We now turn to the proof of Theorem [2j Note that since the update function is contractive, 
the existence and uniqueness of the BP fixed point is an immediate consequence of the Banach 
fixed-point theorem [Tj. 

4.2.1 Part (a): Almost Sure Consistency 

We establish part (a) by applying the Robbins- Monro theorem, a classical result from stochas- 
tic approximation theory (e.g., [221 S] ) - In order to do so, we begin by writing the update (fTTj) 
in the form 

m t+i _ t _\t( m t _ r /. rf+hi 
'"'uv ~ '"'uv A \" L uv L uv\-i J uv l]i 
s , ' 

where for any realization J uv G {1,2, ... ,d}, the mapping m uv h-> H uv (m uv , J uv ) should be 
understood as a function from W d to W 1 . By concatenating together all of these mappings, 
one for each directed edge (u — > v), we obtain a family of mappings H(-,J) from R D to R D , 
one for each realization J G {1, 2, . . . , d} 2 '^' of column indices. 

With this notation, we can write the message update of the SBP algorithm in the compact 
form 

m t+1 = m* — A* H(m\ J m ), valid for for t = 1, 2, . . ., (32) 

suitable for application of the Robbins-Monro theorem]! In order to apply this result, we 
need to verify its hypotheses. First of all, it is easy to see that we have a bound of the form 

E[\\H(m,J)f 2 ] < c(l + \\m\\ 2 2 ), 

for some constant c. Moreover, the conditional distribution of the vector J* , given the past, 
depends only on m*; more precisely we have 

^{j t+1 \J t ,J t - 1 ,...,m\m t - 1 ,...) = P(j* +1 |m*). 

Lastly, defining the averaged function h(m) := E[ff(m, J)|m] = m — F(m), the final require- 
ment is to verify that the fixed point m* satisfies the stability condition 

inf (m — m*, h(m)) > 0, (33) 

mGiS\{m* } 

where (•, •) denotes the Euclidean inner product, and S denotes the compact set in which 
the messages lie. Using the Cauchy-Schwartz inequality and the fact that F is Lipschitz with 



9 The theorem states that if the vector field function H(m,-) has a bounded second moment — that is 
E [||JJ(m, «7)||!1 < c (l + Il m ll2) for some constant c, the conditional distribution of the random vector J t+1 
knowing the past depends only on m* — that is P(j' +1 j J*, J t_1 , • ■ • , m 1 , m* -1 , ■ • ■ ) = P(J' +1 jm'), denoting 
the expected vector field function h(m) := ~E[H(m, J)|m] , there exist a vector m* such that 

inf (m — m*, h(m)) > 0, 

m£S\{m* } 

and finally the step sizes satisfy the conditions 53t=o ^ = °°> anc ^ X^o^') 2 < °°> then the sequence {m'}™ 
converges almost surely to m*. 
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constant L = 1 — fJ,/2, we obtain 

(to — to*, 7t(m) — h(m*)) = \\m — m*\\\ — (to — m*, F(m) — F(m*)) 

> | \\m-m*\\l > 0, (34) 

where the strict inequality holds for all to ^ to*. Since m* is a fixed point, we must have 
h(m*) = to* — F(m*) = 0, which concludes the proof. 

4.2.2 Part (b): Non-asymptotic Bounds on Mean- squared Error 

Let e* := (to* — TO*)/||m*||2 denote the re-normalized error vector. In order to upper bound 
I£[Jl e *ll!] for all t = 1,2, . . ., we first control the quantity ||e t+1 ||2 — 1 1 e* 1 1 § ^ corresponding to 
the increment in the squared error. Doing some simple algebra yields 

l|e i+1 |||-|ffi = * (llm^-TOllHIm'-mll) 
W 171 lb 

' (m t+1 - to*, m t+1 + to* - 2m*). 



I 777 -*!!! 



Recalling the update equation (f3"2"j) . we obtain 

|e* +1 ||2 " l|e*||| = ~\\~Tii2 (~^ t H(m t , J <+1 ), — X t H(m t , J* +1 ) + 2(to* — to*)) 

|if(m*, J t+1 )i - — ^ (Him*, J t+1 ), m<-m*>. (35) 



h- 1 


ra* 


totoj 


(A*) 


to 


m* 


totoj 



I™ 112 

Now taking the expectation from both sides of the equation (f35|) yields 

E[||e m ||i]-E[||e*||i] = j% E[\\H(m* t J t+1 )g] - ^ E[E[(H(m*, J m ), m* - m*)|J*]] 

1 1 1 1 2 II 1 1 2 

= ^>E[\\H(m\j t+1 )\\ 2 2 ] - -^E^mV/iKj.m'-m*}], 
ll m II2 \\ m II2 

(36) 

where we used the facts that E[i?(m*, J <+1 )| F 1 ] = him 1 ) and h(m*) = 0. We continue 
by upper bounding the term G\ = ||i7(m*, ^ <+1 )||2/ll m *ll2 an d lower bounding the term 
G2 = (him 1 ) — h(m*), m t — m*) /\\m*\\l. 

Lower bound on G2' Recalling (I34D from our proof of part (a), we see that 

G 2 >f||e*||I. (37) 

Upper bound on G\i Prom the definition of the update function, we have 

\\H(m\j^)f 2 = £ ||<,-r w (:,J*J||! <2 ^ (KJ 2 . + ||r w (:, J* B )||jj). 

(u—>-v)e£ 
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Recalling the bounds ([H]) and using the fact that vectors m*„ and T uv (:, J^ v ) sum to one, we 
obtain 

\\H(m\j t+1 )\\l <2 (^^(0) (ll"4lli + l|r w (:,4«)lli) 

On the other hand, we also have 

||m*||| > Y, {™%^uM\\<vh = E (mm^(i)). 
(u— >v)e£ (u— >v)e£ 
Combining the pieces, we conclude that the term G\ is upper bounded as 

G\ < K&) := 4 ^ ( — )gg . " J].! . (38) 

Since both G\ and G2 are non-negative, the bounds (|38l) and (l3?|) also hold in expec- 
tation. Combining these bounds with the representation (|36p . we obtain the upper bound 
E[||e* +1 ||2] -E[||e*||§] < K(i/>) (A*) 2 - A*/iE[||e*|||], or equivalent^ 

n\\e t+1 \\l] < ^W(A') 2 + (l-AV)E[||e*|||]. 
Setting A* = a/(fj,(t + 2)) and unwrapping this recursion yields 

E „ s ^E(?nH)) + nH) ^ 

where we have adopted the convention that the inside product is equal to one for i = t + 2. 
The following lemma, proved in Appendix O provides a useful upper bound on the products 
arising in this expression: 

Lemma 2. For all i G {1,2, ... ,t + 1}, we /icwe 

t+2 

m 

£=i+l 

Substituting this upper bound into the inequality ([39]) yields 

111 " 2J - /x 2 (t + 3) a ^ i 2 v* + 3y 



a\ (i + l 
t + 3 



kmc? ,y ^ j_ / Erilc o|| 2l 

- ^ 2 (i + 3)^2 j ^ i 2 - + Vt + sJ ^ [l|e ll2j - 



It remains to upper bound the term ^i=2 V* 2 a - Since the function 1/x 2 a is decreasing in 
x for a < 2, we have the integral upper bound YtiXl l/i 2 ~ a < fl +2 l/x 2_a dx, which yields 



E[|| e * +1 |||] < { 



r (l) Q ^(^F + (mTn\\e%] if0<a<l 
3^)10^ + ^e^oiH] ifa = 1 

l(f) a ^^+ ifKa<2 



If we now focus on the range of a £ (1,2), which yields the fastest convergence rate, some 
simple algebra yields the form of the claim given in the theorem statement. 
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4.2.3 High Probability Bounds 

Recall the algebra in the beginning of the Section 14.2.21 Subtracting the conditional mean of 
the second term of the equation ([35]) yields 



\\m H2 ll m II2 

where we have denoted the term 

him*) - H(m t ,J t+1 ) 



m* 2 



Recalling the bounds on G\ = J* +1 )||2 / II 771 *!!! an d G2 = (/i(m*), m* — m" 

from part (b), we have 



m*||| 



l|e m ||i - 1 1 1 1 § < tfty) (A*) 2 - fiX'We'g + 2A* (Y t+1 , e 4 ), 

or equivalently 

\\e t+1 \\ 2 2 < i^(V)(A*) 2 + (l-/xA*)||e*||| + 2A* (Y* +1 , e*). 
Substituting the step size choice A* = l/(^(t + 1)) and then unwrapping this recursion yields 

I, t+l||2 < Kjj,) ^1 2 y /yT+l .TV 

A 1 t + 1 /X {t + 1) ^ 

Note that by construction, the sequence {Y T }^ =1 is a martingale difference sequence with 
respect to the filtration T T = a(m°, m 1 , . . . , m T ) that is E[V r+1 | J 71 "] = and accord- 
ingly E[y r+1 ] = for r = 0,1,2,.... We continue by controlling the stochastic term 
(St=o(^ r+1 ' eT ))/(t + 1) — namely its variance, 

( 1 ,* 

var 



r=0 ■ 1 - L 



T = 

1 

- V E[(T T+1 e Tx21 



Ti 

+ ^ E[(F-+ 1 ,e-)(y- +1 ,e^)]. 

^ ' 0<T 2 <Tl<t 
V v ' 

T 2 



Since we have 



E[(F T1+1 , e T1 )(F T2+1 , e T2 )} = E[E[(y T1+1 , e T1 ){Y T2+1 , e T2 ) \ F T1 }] 

= E[(Y T2+1 , e T2 ) E[(y T1+1 , e T1 ) | J^ 1 ]] = 0, 
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for all T\ > T2, the cross product term T2 vanishes. On the other hand, the martingale 
difference sequence is bounded. This can be shown as follows: from part (b) we know 
\\H(m T , J r+1 )||2/||"i*||2 < v K(ip); also using the fact that 1 1 - 1 1 3 is convex, Jensen's inequality 
yields ||/i(m T )||2/||m*||2 < \J K(i[)); therefore, we have 

1 1 77T.* 1 1 2 ll m *l|2 

Moving on to the first term T\, we exploit the Cauchy Schwartz inequality in conjunction 
with the fact that the martingale difference sequence is bounded to obtain 

E[(Y-+ 1 ,e0 2 ] < m\Y T+1 \\ 2 2\\e T \\l] < * K E[\\e T g] . 

Taking the expectation from both sides of the inequality (j40l) yields IE [ 1 1 e" 7- 1 1 2] < (K(ip)/^i 2 ) (1 + log t)/t; 
and hence we have 

AK(7p) 2 1 + logr 



for all r > 1. Moreover, since 



l m °l|2 ^ 



1Mb ~ yj2 (u ^ v)e£ - {mh MeX B° v (i)) 

the initial term E[(Y\ e ) 2 ] < 4^(-^) E [ 1 1 e° 1 1 1] is upper bounded by 4K(tjj) 2 . Finally, putting 
all the pieces together, we obtain 

r^l + logr 4 TOO 5 
varl- -> (r T1 ,e')] < „,, v '' S — + 



+ 1 V ' e V -^ 2 (t + i) 2 h 



0) 4 TO) 2 (l + log(t + l)) 2 + 4 
" fi 2 (t + 1) 2 

where inequality (i) follows from the facts ^^_ =1 (1 + log t)/t < (1 + logi) 2 , and fi < 2. Conse- 
quently, we may apply Chebyshev's inequality to control the stochastic deviation X^t=i (X T+1 i eT ) /(* + !)• 
More specifically, for 7 > (to be specified) we have 



16 TO) 2 (l + log(i + l)) 2 + 4 



r=0 



/X 4 7 2 (t + 1 

We now combine our earlier bound (|4Up with the tail bound (|4ip . making the specific choice 



4 if (VQ ^(1 + log(t + l)) 2 + 4 
7 " /x 2 ^ + 1 

for a fixed < e < 1, thereby concluding that 



H t+ii|2 < 1 + + 1) 1 V(l + log(^ + l)) 2 + 4 

l|e 1,2 - t + i li 2 ^ t + 1 

with probability at least 1 — e. Simplifying the last bound, we obtain 

t + n,2 < + i+i°g(* + i) 



"" 1,2 " a* 2 V~ ' V^J t + i 

for all i > 1, with probability at least 1 — e. 
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4.3 Proof of Proposition [JJ 



Recall the definition (fT0|) of the probability mass function {p U v(j)}jex used in the update 
of directed edge (u — > v). This probability depends on the current value of the message, so 
we can view it as being generated by a function q uv : M. D — > M. d that performs the mapping 
m ^ {Puv(j)}jex- in terms of this function, we can rewrite the BP message update ([5]) 
on directed edge (u — > v) as F uv (m) = T uv q U v(m), where the renormalized compatibility 
matrix T uv was defined previously ([8]). We now define the D x D block diagonal matrix 
r := blkdiagjrut,}^^^^^, as well as the function q : M> D — > M. D obtained by concatenating 
all of the functions q uv , one for each directed edge. In terms of these quantities, we rewrite 
the global BP message update in the compact form F{m) = V q{m). 

With these preliminaries in place, we now bound the Lipschitz constant of the mapping 
F : M. D — > M. D . Given an arbitrary pair of messages m, m! £ S, we have 

\\F{m) - F{m')\\l = \\T (q(m) - q(m'))\\ 2 2 = \\ T uv{q U v{m) - q uv {m')) |||. (42) 

(u-s-u)e£ 



By the Perron- Frobenius theorem [TT], we know that T uv has a unique maximal eigenvalue 
of 1, achieved for the left eigenvector 1 G l d , where 1 denotes the vector of all ones. Since 
the d-dimensional vectors q U v(fn) and q uv (rn') are both probability distributions, we have 
(1, q uv (m) — q uv {m')) = 0. Therefore, we conclude that 

Zuv fr 

^uv{quv(rn) - q U v{m')) = (T uv - — ) (q uv {m) - q uv {m')), 

where z uv denotes the right eigenvector of T uv corresponding to the eigenvalue one. Combining 
this equality with the representation (|4"2"j) . we find that 

z F 

\\F(m) - F{m')\\l = V \\{T UV -^ — ) (q uv (m) - q uv {m')) \\\ 



-uv 



< max \\T UV -^ — \\q(m) - q(m')\\l. (43) 
(u-+v)e£ 1 z v 



It remains to upper bound the Lipschitz constant of the mapping q : W D — > W D previously 
defined. 

Lemma 3. For all m 7^ m' , we have 

\\q(m) -q(m')\\ 2 ^ ^ ^ m&x ^ WjU ^ (44) 

\\ m - m h (u^v)es (w^u)e£ 

where the quantities &i(u,v), and &2(w,u) were previously defined in (I21ah and (I21bl) . 

As this proof is somewhat technical, we defer it to Appendix |Dj Combining the upper 



bound (|44p with the earlier bound (143 p completes the proof of the proposition. 



5 Experimental Results 

In this section, we present a variety of experimental results that confirm the theoretical 
predictions, and show that SBP is a practical algorithm. We provide results both for simulated 
graphical models, and real-world applications to image denoising and disparity computation. 
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5.1 Simulations on Synthetic Problems 



We start by performing some simulations for the Potts model, in which the edge potentials 
are specified by a parameter 7 S (0, 1], as discussed in Example [TJ The node potentials are 
generated randomly, on the basis of fixed parameters /i > a > satisfying fj, + a < 1, as 
follows: for each u 6 V and label i 7^ 1, we generate an independent random variable Z u .^ 
uniformly distributed on the interval (— 1,+1), and then set 

Mi) = \ 1 ! ■ 




Figure 4. Panels illustrate the normalized squared-error m*|||/||m*||5 versus the number 
of iterations t for a chain of size n = 100 and state dimension d = 64. Each plot contains 10 
different sample paths. Panel (a) corresponds to the coupling parameter 7 = 0.02 whereas 
panel (b) corresponds to 7 = 0.05. In all cases, the SBP algorithm was implemented with step 
size A* = 2/(t + 1), and the node potentials were generated with parameters (/x, a) — (0.1, 0.1). 



For a fixed graph topology and collection of node/edge potentials, we first run BP to 
compute the fixed point m*cj We then run SBP algorithm to find the sequence of messages 
{m*}oo o an( ^ com p U t e the normalized squared error ||m* — ^*||2/ll m *lli- I n cases where the 
mean squared error is reported, we computed it by averaging over 20 different runs of the 
algorithm. (Note that the runs are different, since the SBP algorithm is randomized.) 

In our first set of experiments, we examine the consistency of the SBP on a chain-structured 
graph, as illustrated in Figure QJb), representing a particular instance of a tree. We imple- 
mented the SBP algorithm with step size A* = 2/(t + l), and performed simulations for a chain 
with n = 100 nodes, state dimension d = 64, node potential parameters (j«, a) = (0.1,0.1), 
and for two different choices of edge potential 7 £ {0.02,0.05}. The resulting traces of the 
normalized squared error versus iteration number are plotted in Figure [H each panel contains 
10 different sample paths. These plots confirm the prediction of strong consistency given in 
TheoremQJa) — in particular, the error in each sample path converges to zero. We also observe 
that the typical performance is highly concentrated around its average, as can be observed 
from the small amount of variance in the sample paths. 

10 We stop the BP iterations when |jm t+1 — m'||2 becomes less than 10 -4 . 



24 



10" 



10 



Number of iterations 

(a) 



10 



Number of iterations 



(b) 



Figure 5. Effect of increasing state dimension on convergence rates. Plots of the normalized 
mean squared-error E[||m* — rr** || §] /||w*||| versus the number of iterations for two different 
graphs: (a) chain with n = 100 nodes, and (b) two-dimensional square grid with n = 100 nodes. 
In both panels, each curve corresponds different state dimension d € {128, 256, 512, 1024}. All 
simulations were performed with step sizes A* = 2/(f + 1), and the node/edge parameters were 
generated with parameters (/i, a) = (0.1, 0.1) and 7 = 0.1 respectively. 



Our next set of simulations are designed to study the effect of increasing of the state 
dimension d on convergence rates. We performed simulations both for the chain with n = 
100 nodes, as well as a two-dimensional square grid with n = 100 nodes. In all cases, we 
implemented the SBP algorithm with step sizes A* = 2/(t + 1), and generated the node/edge 
potentials with parameters (/i, a) = (0.1,0.1) and 7 = 0.1 respectively. In Figure El we plot 
the normalized mean-squared error (estimated by averaging over 20 trials) versus the number 
of iterations for the chain in panel (a), and the grid in panel (b). Each panel contains four 
different curves, each corresponding to a choice of state dimension d € {128,256,512,1024}. 
For the given step size, Theorem [2] guarantees that the convergence rate should be l/t a (a < 1) 
with the number of iterations t. In the log- log domain plot, this convergence rate manifests 
itself as a straight line with slope —a. For the chain simulations shown in panel (a), all four 
curves exhibit exactly this behavior, with the only difference with increasing dimension being 
a vertical shift (no change in slope). For the grid simulations in panel (b), problems with 
smaller state dimension exhibit somewhat faster convergence rate than predicted by theory, 
whereas the larger problems (d G {512, 1024}) exhibit linear convergence on the log-log scale. 

As discussed previously, the SBP message updates are less expensive by a factor of d. The 
top two rows of Table [5TT1 show the per iteration running time of both BP and SBP algorithms, 
for different state dimensions as indicated. As predicted by theory, the SBP running time per 
iteration is significantly lower than BP, scaling linearly in d in contrast to the quadratic scaling 
of BP. To be fair in our comparison, we also measured the total computation time required 
for either BP or SBP to converge to the fixed point up to a <5-tolerance, with 5 = 0.01. This 
comparison allows for the fact that BP may take many fewer iterations than SBP to converge 
to an approximate fixed point. Nonetheless, as shown in the bottom two rows of Table l5TT| in 
all cases except one (chain graph with dimension d = 128), we still see significant speed-ups 
from SBP in this overall running time. This gain becomes especially pronounced for larger 
dimensions, where these types of savings are more important. 
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d= 128 


d = 256 


d = 512 


d = 1024 




BP (per iteration) 


0.0700 


0.2844 


2.83 


18.0774 


SBP (per iteration) 


0.0036 


0.0068 


0.0145 


0.0280 


BP (total) 


0.14 


0.57 


5.66 


36.15 


SBP (total) 


0.26 


0.27 


0.29 


0.28 


Grid 


BP (per iteration) 


0.1300 


0.5231 


5.3125 


32.5050 


SBP (per iteration) 


0.0095 


0.0172 


0.0325 


0.0620 


BP (total) 


0.65 


3.66 


10.63 


65.01 


SBP (total) 


0.21 


1.31 


0.65 


0.62 



Table 1. Comparison of BP and SBP computational cost for two different graphs each with 
n = 100 nodes. For each graph type, the top two rows show per iteration running time (in 
seconds) of the BP and SBP algorithms for different state dimensions. The bottom two rows 
show total running time (in seconds) to compute the message fixed point to 5 — 0.01 accuracy. 



5.2 Applications in Image Processing and Computer Vision 

In our next set of experiments, we study the SBP on some larger scale graphs and more 
challenging problem instances, with applications to image processing and computer vision. 
Message-passing algorithms can be used for image denoising, in particular, on a two dimen- 
sional square grid where every node corresponds to a pixel. Running the BP algorithm on the 
graph, one can obtain (approximations to) the most likely value of every pixel based on the 
noisy observations. In this experiment, we consider a 200 x 200 image with d = 256 gray-scale 
levels, as showin in Figure [6^a). We then contaminate every pixel with an independent Gaus- 
sian random variable with standard deviation a = 0.1, as shown in Figure [6^b). Enforcing 
the Potts model with smoothness parameter 7 = 0.05 as the edge potential, we run BP and 
SBP for the total of t = 5 and t = 100 iterations respectively to obtain the refined images 
(see panels (c) and (d), respectively, in Figure [6|). Figure [7] illustrates the mean squared error 
versus the running time for both BP and SBP denoising. As one can observe, despite smaller 
jumps in the error reduction, the per-iteration running time of SBP is substantially lower 
than BP. Overall, SBP has done a marginally better job than BP in a substantially shorter 
amount of time in this instanced 

Finally, in our last experiment, we apply SBP to a computer vision problem. Graphical 
models and message-passing algorithms are popular in application to the stereo vision prob- 
lem \28\ 116]. in which the goal is to estimate objects depth based on the pixel dissimilarities 
in two (left and right view) images. Adopting the original model in Sun et al. [28], we again 
use a form of the Potts model in order to enforce a smoothness prior, and also use the form of 
the observation potentials given in the Sun et al. paper. We then run BP and SBP (with step 
size 3/(i + 2)) for a total of t = 10 and t = 50 iterations respectively in order to estimate the 
pixel dissimilarities. The results for the test image "map" are presented in Figure El Here, the 
maximum pixel dissimilarity is d = 32, which makes stereo vision a relatively low-dimensional 
problem. In this particular application, the SBP is faster by about a factor of 3 — 4 times per 
iteration; however, the need to run more iterations makes it comparable to BP. This is to be 
expected since the state dimension d = 32 is relatively small, and the relative advantage of 
SBP becomes more significant for larger state dimensions d. 

11 Note that the purpose of this experiment is not to analyze the potential of SBP (or for that matter BP) 
in image denoising, but to rather observe their relative performances and computational complexities. 
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Figure 6. Image denoising application, (a) original image, (b) noisy image, (c) refined image 
obtained from BP after t = 5 iterations, and (d) refined image obtained from SBP after t = 100 
iterations. The image is 200 x 200 with d = 256 gray-scale levels. The SBP step size, the Potts 
model parameter, and noise standard deviation are set to A* = l/(t + 1), 7 = 0.05 and a = 0.1 
respectively. 

6 Discussion 

In this paper, we have developed and analyzed a new and low-complexity alternative to BP 
message-passing. The SBP algorithm has per iteration computational complexity that scales 
linearly in the state dimension d, as opposed to the quadratic dependence of BP, and a 
communication cost of log d bits per edge and iteration, as opposed to d — 1 real numbers 
for standard BP message updates. Stochastic belief propagation is also easy to implement, 
requiring only random number generation and the usual distributed updates of a message- 
passing algorithm. Our main contribution was to prove a number of theoretical guarantees for 
the SBP message updates, including convergence for any tree-structured problem, as well as 
for general graphs for which the ordinary BP message update satisfies a suitable contraction 
condition. In addition, we provided non-asymptotic upper bounds on the SBP error, both in 
expectation and in high probability. 

The results described here suggest a number of directions for future research. First, the 
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Figure 7. Mean squared error versus the running time (in seconds) for both BP and SBP 
image denoising. The simulations are performed with the step size A* = l/(i + 1), and the 
Potts model parameter 7 = 0.05 on a 200 x 200 image with d = 256 gray-scale levels. The noise 
is assumed to be additive, independent Gaussian random variables with standard deviation 



ideas exploited here have natural generalizations to problems involving continuous random 
variables and also other algorithms that operate over the sum-product semi-ring, including 
the generalized belief propagation algorithm [33] as well as reweighted sum-product algo- 
rithms [31]. More generally, the BP can be seen as optimizing the dual of the Bethe free 
energy function [33J, and it would be interesting to see if SBP can be interpreted as a stochas- 
tic version of this Bethe free energy minimization. It is also natural to consider whether similar 
ideas can be applied to analyze stochastic forms of message-passing over other semi-rings, such 
as the max-product algebra that underlies the computation of maximum a posteriori (MAP) 
configurations in graphical models. In this paper, we have developed SBP for applications to 
Markov random fields with pairwise interactions. In principle, any undirected graphical model 
with discrete variables can be reduced to this form [331 132] : however, in certain applications, 
such as decoding of LDPC codes over non-binary state spaces, this could be cumbersome. For 
such cases, it would be useful to derive a variant of SBP that applies directly to factor graphs 
with higher-order interactions. Finally, our analysis for general graphs has been done under 
a contractivity condition, but it is likely that this requirement could be loosened. Indeed, the 
SBP algorithm works well for many problems where this condition need not be satisfied. 
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a = 0.1. 
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Figure 8. Stereo vision, depth recognition, application, (a) reference image, (b) ground truth, 
(c) BP estimate after t = 10 iterations, and (d) SBP estimate after t — 50 iterations. The 
algorithms are applied to the standard "map" image with maximum pixel dissimilarity d = 32. 
The SBP step size is set to A* = 3/(i + 2). 



A Details of Example [T] 

In this appendix, we verify the sufficient condition for contractivity (I24D . Recall the defini- 
tion ()14p of the zero'th order bounds. By construction, we have the relations 



R uv (i) = Rl v (i) 



7 



-, and 



l + (d-l) 7 ' 
o 1 

B M v(i) = B uv {i) = Y^Tu — fj - ^ or an * e ^ an d («—>•«)€ £■ 



Substituting these bounds into the definitions (|22aj) and (|22b|) and doing some simple algebra 
yields the upper bounds 



£Af(u)\{v,w} Bsu(j) 



< max 



l + (d-l)7 



max 



< max 



Puv{j) HseM-(u)\v B su(j) 



max 



l + (d-l)7 



7 



Pu 



max 



and 



where we have denoted the degree of the node u by p u . Substituting these inequalities into 
expression (|23p and noting that 7 < 1, we find that the global update function has Lipschitz 
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constant at most 



L < 4(1 _ 7)(1 + ((i _ 1)7) n S {^0- ff ?{^} 2 }- 

as claimed. 



B Proof of Lemma [T] 

By construction, for each directed edge (it — > v), the message vector m uv belongs to the 
probability simplex — that is, Ylizx m uv(i) = 1> and m uv y 0. From equation ([25]) . the vector 
m uv is a convex combination of the columns of the matrix V. Recalling bounds (|14p . we 
conclude that the message vector must belong to the set S, as defined in equation (fT9|) . in 



particular with B_ uv (i) = B® uv (i) and B uv (i) = B uv (i). Note that the set S is compact, and 
any member of it has strictly positive elements under our assumptions. 

For directed edges (u — > v) and (w — > s), let G M dxd denote the Jacobian matrix 

obtained from taking the partial derivative of the update function F uv with respect to the mes- 
sage vector m ws . By inspection, the function F uv is continuously differentiable; consequently, 
the function 9 ^ is continuous, and hence must achieve its supremum over the compact 

set S. Consequently, we may use these Jacobian matrices to define a matrix A UVjWS £ M. dxd 
with entries 



A U v,ws{ii j) '■ — max 



dF uv (i;m) 



dm ws (j) 



for i,j = l,...,d. 



We then use these matrices to define a larger matrix A 6 M. DxD , consisting of 2\£\ x 2\£\ sub- 
blocks each of size d x d, with the sub-blocks indexed by pairs of directed edges (u — > v) £ £. 
In particular, the matrix A UVjWS occupies the sub-block indexed by the edge pair (u — > v) and 
(w — > s). Note that by the structure of the update function F, the matrix A UV)WS can be 
non-zero only if s = u and w £ Af(u)\{v}. 

Now let \7F G M. DxD denote the Jacobian matrix of the update function F. By the 
integral form of the mean value theorem, we have the representation 



F(m) - F{rri) 



I X7F(m' + T{m-m'))dr 
Jo 



(m — m'). 



Applying triangle inequality separately to each component of this D-vector and then using 
the definition of A, we obtain the elementwise upper bound 

\F(m) — F(m')\ X A \m — m\. 

It remains to show that A is nilpotent: more precisely, we show that A r is the all-zero 
matrix, where r = diam(^) denotes the diameter of the graph Q. In order to do so, we first 
let B G 1 £" | x 2 1 £: | k e the "block indicator" matrix — that is, its entries are given by 



B(u — )■ v, w 



1 if A uv ws ^ 
otherwise. 
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Based on this definition, it is straightforward to verify that if B r = for some positive inte- 
ger r, then we also have A r = 0. Consequently, it suffices to show that B r = for r = diam(^). 

Fix a pair of directed edges (u — > v) and (w — > s), and some integer £ > 1. We first 
claim that the matrix entry B*(u — > v,w — > s) is non-zero only if there exists a directed 
path of length I + 1 from it; to u that includes both s and it, meaning that there exist nodes 
si, S2, ■ ■ ■ , S£_2 such that 

w£j\f(s)\si, si G 7V(s2) \ S3, . . . , and s/_ 2 e^(a)\i). 

We prove this claim via induction. The base case £ = 1 is true by construction. Now supposing 
that the claim holds at order £, we show that it must hold at order £ + 1. By definition of 
matrix multiplication, we have 

B e+1 (u ->■ it; — > s) = i^(u — >• «, x — >• y) -B(x — >■ y, u> — >• s). 

In order for this entry to be non-zero, there must exist a directed edge (x —> y) that forms 
a {£ + redirected path to (ii — > v), and moreover, we must have s = x, and w G M(x) \ y. 
These conditions are equivalent of having a directed path of length £ + 2 from w to jj, with s 
and u as intermediate nodes, thereby completing the proof of our intermediate claim. 

Finally, we observe that in a tree-structured graph, there can be no directed path of length 
greater than r = diam(^). Consequently, our intermediate claim implies that B r = for any 
tree-structured graph, which completes the proof. 



C Proof of Lemma [2] 

Noting that it is equivalent to bound the logarithm, we have 

t+2 , v t+2 , N t+2 

where we used the fact that log(l— x) < —x for x G (0, 1). Since the function 1/x is decreasing, 
we have 

*+ 2 i rt+3 y 

V - > / -dx = log(t + 3) - log(i + l). (46) 

Substituting inequality ggj into (05]) yields log Ed±? +1 (l - f ) < a ( log(i + 1) - log(t + 3)) , 
from which the claim stated in the lemma follows. 



D Proof of Lemma [3] 

Let \7q(m) G ~K DxD denote the Jacobian matrix of the function q : W D — > R-° evaluated at m. 
Since q is differentiable, we can apply the integral form of the mean value theorem to write 
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q(m) — q{m') = [ Vq(m' + r(m — m')) dr] (m — m'). From this representation, we obtain 
the upper bound 



/(to) - q(m')\\ 2 < 



\\Vq{m' + \{m-m'))\\ 2 d\ 



(TO-m')||2 < sup ||Vg(m)|2 \\m - m'\\ 2 , 



showing that it suffices to control the quantity sup mg5 ||| Vg(m)|||2. 

Let 9 Q^} m ^ be the d x d matrix of partial derivatives of the function q uv : MP — > R d 
obtained from taking the partial derivatives with respect to the message vector m ws S M. d . 
We then define a 2\£\ x 2 \£\ -dimensional matrix A with the entries 

A(u^v,w^s):= | SUPmeS 1 %S? i2 ifs = U ' ^weM(u)\{v} ^ 
1 otherwise. 

Our next step is to show that sup me5 ||| Vg(T7i)||| 2 < |||-4|||2- Let y = {y U v}r u ^ v \ e g be an 
arbitrary D-dimensional vector, where each sub- vector y uv is an element of M. d . By exploiting 
the structure of Vq{m) and y, we have 



II V<z(to) ylli = E II E 



dq uv (m) 2 

ywu\\2 



(u-tv)eS weN(u)\{v} 

* E ( E w-^r Vwuh 



(u^v)ee weAf{u)\{v} 

dfn wu 



( 5 / m dq uv (m) 

^ E E l-^T— 



e£ io&/V(«)\M 

(iii) / \ 2 

< 2^ ( 2^ u >™ ~^ u )\\yu>v.h ) > 

(«->i>)g£ u>eAf(«)\M 

where the bound (i) follows by triangle inequality; the bound (ii) follows from definition of 
the operator norm; and the final inequality (iii) follows by definition of A. 

Defining the vector z 6 M 2 ^! with the entries z wu = \\y wu \\2, we have established the upper 
bound \\Vq(m) < ||Az|| 2 , and hence that 

||Vg(TO)y|| 2 < \\A\f 2 \\z\\ 2 2 = |||A||||||y||| 5 

where the final equality uses the fact that ||y|| 2 = ||2;||| by construction. Since both the mes- 
sage to and vector y were arbitrary, we have shown that sup mg5 ||| V<7(m)|2 < |||-<4|||2> as claimed. 

Our final step is to control the quantities sup mg5 II 9 &r}™^ III 2 * na ^ define the entries of A. 
In this argument, we make repeated use of the elementary matrix inequality [11] 



\\B\g i f .max^El^il) ( mK^l^l), M8) 
j=i ' ^ ' i=i 



|B|„ IIIBII 

valid for any n x n matrix. 
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Recall the definition of the probability distribution (jlOp that defines the function q uv : 
as well as our shorthand notation M uv (x u ) = Yl we j\ff u )\{v} m wu( x u)- Taking the derivatives 
and performing some algebra yields 

dq uv (i ; m) _ dq uv (i ; m) dM uv (k) 
dm wu (j) j-^ dM uv {k) dm wu (j) 

_ dq uv (i ; m) M uv (j) 
dM uv (j) (i) 
-/3 UV (t) M ut , (t) /3 W (i) M ra ( j) 



' {Yl=iPuv{k)M uv {k)f m wu (j)' 
for i ^ j, and w G AA(u)\{t>}. For i = j, we have 
; m) o>g w (i ; m) M uv (i) 



dm wu {i) dM uv (i) 



Puv(i) Puv(i) 2 M uv {i) 



. £*=1 Aw WJMw (fc) ( =1 (3 UV {k)M uv (k))' 
Putting together the pieces leads to the upper bounds 



^^ }l < 2ma J WJ)M) 2_L and 



M uv (i) 



dm ^u ie* I Yl=i Puv(k) M uv {k) m wu (j) 

dq uv (m) ... 



dm wu iex \Yr k=1 p uv {k)M uv {k) m wu (i) (Yl=i Puv{k) M uv (k)) 2 fr[ m wu{j) 

Recalling the definitions (|22a|) and (|22bp of <p U v,wu and Xuv,wu respectively, we find that 
hi dg uv {m) , I,, dq uv {m) 

III |||l ^ Yuv,wui ana ||| |||oo (Puv,wu ~r Xuv,wu- 

uvn wu OTft wu 
Thus, by applying inequality (USD with = , we conclude that 

hi dq uv {m) 2 



2^2 4>uv,wu (fiuv^u Xuv,wu) 



Since this bound holds for any message m £ 5, we conclude that each of the matrix entries 
A(u — > v, w — > u) satisfies the same inequality. Again applying the basic matrix inequal- 
ity (f48l) . this time with B = A, we conclude that |||A|||2 is upper bounded by 



2 maX ^ ^ (jPuVjWU {(ftuVjWU ~t~ XuV ,Wu)) 2 maX ^ ^ i^UV^WU {.^UV^WU ~i~ ~XuV,U]u)) 2 5 

which concludes the proof. 
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